Red wine Quality by Ghadah Alkhayat

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Introduction: red wine quality data set contains 1599 observations and 13 variables. The variables describe the chemical characteristics of the wine in addition to the quality ranking that varies from 3 to 8.

Univariate Plots Section

From the plot above, we can see that almost 41% of the wine observations quality are categorized 5 and almost 39% are categorized 6

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

It sounds that high residual.sugar is not normal in wine observations since the third quantile is 2.6 while the max is 15.5. Therefore, I limited the histogram to 3, which makes the data more normally distributed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol histogram is skewed to the right, which means the data is not normally distributed with peak near 9.5%. The median and the mean are around 10%. According to the link below, higher alcohol level makes the wine taste dry while the taste would be sweet for levels under 12.5%. Therefore, the relationship between alcohol and residual.sugar should be explored in the next section (Bivariate Analysis) source: https://www.everwonderwine.com/blog/2017/1/14/is-there-a-relationship-between-a-wines-alcohol-level-and-its-sweetness

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH is normally distributed. The median and the mean are around 3. Higher pH level means lower acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

density is normally distributed. The median and the mean are around 0.99.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

sulphates histogram is skewed to the right, which means the data is not normally distributed. The median is 0.62 and the mean is 0.66.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The volatile.acidity histogram is almost normally distributed with median and mean around 0.52

Univariate Analysis

What is the structure of your dataset?

1599 observations and 13 variables ## What is/are the main feature(s) of interest in your dataset? Acidity, tannin, alcohol and sweetness are the main traits that affect the red wine quality. source: https://winefolly.com/review/understanding-acidity-in-wine/ Therefore, my exploration focuses on pH, alcohol,density and residual.sugar variables to see their effects on the quality variable

What other features in the dataset do you think will help support your into your feature(s) of interest?

  • Sulfur.dioxide (SO2) is used to preserve the flavor and freshness of wines.In the U.S., the allowed upper limit is 350 ppm.

https://www.quickanddirtytips.com/health-fitness/healthy-eating/myths-about-sulfites-and-wine

Did you create any new variables from existing variables in the dataset?

No

Of the features you investigated, were there any unusual distributions? you perform any operations on the data to tidy, adjust, or change the form the data? If so, why did you do this?

  • Alcohol histogram is skewed to the right, but I did not change it since most of the observations are close to the mean
  • In residual.sugar, high levels are not normal in wine observations since the third quantile is 2.6 while the max is 15.5. Therefore, I limited the histogram to 3, which makes the data more normally distributed

Bivariate Plots Section

source: https://www.statmethods.net/graphs/scatterplot.html

interesting correlations: - residual.sugar and density are strongly correlated.
- quality is strongly correlated with alcohol, citric.acid, volatile.acidity & sulphates. The problem is among these variables there are also correlations: 1- volatile.acidity & sulphates are correlated 2- citric.acid is correlated with sulphates - alcohol and pH are strongly correlated

not interesting correlations: - as expected, free.sulfur.dioxide and total.sulfur.dioxide are strongly correlated, but I will not explore this relationship since the first is part of the second. - citric.acid is negatively correlated with pH and with volatile.acidity. Higher acidity means lower pH - fixed.acidity is negatively correlated with pH and with volatile.acidity - pH & cholrides are correlated - total.sulfur.dioxide & alcohol are strongly correlated. The reason could be that the alcoholic fermentation produces sulfites - alcohol and density are negatively correlated. When alcohol level increases, wine density becomes less - alcohol and cholrides are strongly correlated - density is positively correlated with citric.acid and with fixed.acidity - density is negatively correlated with pH - sulphates & cholrides are correlated

as expected, after 12.5% alcohol level, residual.sugar becomes less since the sweetness reduces and the taste becomes dry

residual.sugar and density are strongly correlated. The reason might be that sweeter wine has higher density

These two variables are strongly correlated. It is clear that wine observations with alcohol level between 11 & 13 got the highest quality ranks

citric.acid level between 0.25 and 0.5 got the highest quality ranks. Citric acid adds flavor to wine, which explains the positive correlation

Volatile.acidity at too high of levels can lead to an unpleasant, vinegar taste, which explains the negative correlation with these two variables

Sulphates can contribute to sulfur dioxide,which in turn is used as a preservative and it affects the wine taste. This explians the positive correlation among sulphates and quality. The highest rank 8 is associated with sulphates level around 0.75

source: https://winobrothers.com/2011/10/11/sulfur-dioxide-so2-in-wine/

Alcohol and pH are positively correlated because riper wines will have higher alcohol content, lower acidity and higher pHs. source:https://www.winespectator.com/drvinny/show/id/How-Does-pH-Affect-Alcohol-in-Wine

Bivariate Analysis

Acidity, tannin, alcohol and sweetness are the main traits that affect the red wine quality. Therefore, I wanted to focus on pH, alcohol,density and residual.sugar variables to see their effects on the quality variable. However, after calculating the correlations among the variables in the dataset, I found out that quality is strongly correlated with alcohol, citric.acid, volatile.acidity & sulphates. Then, I focused on these 4 variables relationship with quality. In summary, I noticed the following; -alcohol level between 12 & 13 got the highest quality ranks -citric.acid level between 0.25 and 0.5 got the highest quality ranks since citric acid adds flavor to wine - volatile.acidity at too high of levels can lead to an unpleasant, vinegar taste, which explains the negative correlation with quality -sulphates can contribute to sulfur dioxide,which in turn is used as a preservative and it affects the wine taste. This explains the positive correlation among sulphates and quality

I also explored some relationships among other variables: 1- between alcohol and residual.sugar: after 12.5% alcohol level, residual.sugar becomes less since the sweetness reduces and the taste becomes dry 2- between residual.sugar and density:residual.sugar and density are strongly correlated. The reason might be that sweeter wine has higher density 3- between alcohol and pH: alcohol and pH are positively correlated because riper wines will have higher alcohol content, lower acidity and higher pHs.

Multivariate Plots Section

as shown above, the quality rank increases with alcohol levels up to 13 and citric.acid up to 0.375. It sounds like the presence of these two together reduces the uper limit for citric.acid from 0.5 to 0.37

as shown above, the quality rank increases with alcohol levels up to 13 and citric.acid up to 0.75.

as shown above, the quality rank decreases with high volatile.acidity level close to 0.6 and increases with alcohol levels up to 13.

as shown above, the quality rank decreases with high volatile.acidity level aund 0.6. When citric.acid is plotted seperatlly, level between 0.25 and 0.5 got the highest quality ranks. However, this trend is not clear in the above plot

as shown above, the quality rank increses with sulphates level between 0.5 and 0.75 and citric.acid level above 0.25 and below 0.75. source: https://ggplot2.tidyverse.org/reference/geom_smooth.html

Multivariate Analysis


Final Plots and Summary

Plot One

Description One

These two variables are strongly correlated. It is clear that wine observations with alcohol level between 11 & 13 got the highest quality ranks

Plot Two

Description Two

as shown above, the quality rank increases with alcohol levels up to 13 and citric.acid up to 0.375. It sounds like the presence of these two together reduces the uper limit for citric.acid from 0.5 to 0.37

Plot Three

Description Three

as shown above, the quality rank decreases with high volatile.acidity level aund 0.6. When citric.acid is plotted seperatlly, level between 0.25 and 0.5 got the highest quality ranks. However, this trend is not clear in the above plot

Reflection

Red wine quality data set contains 1599 observations and 13 variables. The quality variable is the dependent variable, which varies from 3 to 8. Almost 41% of the wine observations quality are categorized 5 and almost 39% are categorized 6, which might influnce the results since the dataset is imbalanced.
Acidity, tannin, alcohol and sweetness are the main traits that affect the red wine quality. Therefore, I decided to explor pH, alcohol,density and residual.sugar to see these variables effects on the quality variable. However, I was surrprised that quality is strongly correlated with volatile.acidity & sulphates.

some of the interesting observations that I found are: 1- high residual.sugar is not normal in wine observations 2- higher alcohol level makes the wine taste dry while the taste would be sweet for levels under 12.5%. 3- sweeter wine has higher density 4- wine observations with alcohol level between 12 & 13 got the highest quality ranks 5- citric.acid level between 0.25 and 0.5 got the highest quality ranks because it adds flavor to wine 6- Volatile.acidity at too high of levels can lead to an unpleasant, vinegar taste. 7- Sulphates can contribute to sulfur dioxide,which in turn is used as a preservative and it affects the wine taste. 8- riper wines will have higher alcohol content, lower acidity and higher pHs.

limitations

in addition to the imbalanced dataset issue that was mentioned above, multicolinearity exists between some variables, such as: 1- volatile.acidity & sulphates are correlated 2- citric.acid & sulphates are correlated

future work

I plan to build prediction model to predict the red wine quality rank based on the four variables volatile.acidity, citric.acid, alcohol and sulpates.I will use classification algorithm, such as decision tree. I beleive that finding the right packege and writing the code would be challanging. I might also try logistic regression algorithm, but to do so, I have to change quality to binary variable (high & low).

Refrences: https://bibinmjose.github.io/RedWineDataAnalysis/ + all sources listed above